{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# sklearn GBDT参数理解"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## sklearn.ensemble.GradientBoostingClassifier\n",
    "\n",
    "class sklearn.ensemble.GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001)    \n",
    "\n",
    "- __loss：__ 可以有两个参数可选，对数似然损失函数\"deviance\"和指数损失函数\"exponential\"。默认是对数似然损失函数\"deviance\"。在原理篇中对这些分类损失函数有详细的介绍。一般来说，推荐使用默认的\"deviance\"。它对二元分离和多元分类各自都有比较好的优化。而指数损失函数等于把我们带到了Adaboost算法。  \n",
    "- __learning_rate：__ 即每个弱学习器的权重缩减系数$v$，也称作步长，加上正则化项，我们的强学习器的迭代公式为$f_k(x)=f_{k−1}(x)+vh_k(x)$。$v$的取值范围为$0<v≤1$。对于同样的训练集拟合效果，较小的$v$意味着我们需要更多的弱学习器的迭代次数。通常我们用步长和迭代最大次数一起来决定算法的拟合效果。所以这两个参数n_estimators和learning_rate要一起调参。一般来说，可以从一个小一点的$v$开始调参，默认是1。\n",
    "- __n_estimators：__ 也就是弱学习器的最大迭代次数，或者说最大的弱学习器的个数。一般来说n_estimators太小，容易欠拟合，n_estimators太大，又容易过拟合，一般选择一个适中的数值。默认是100。在实际调参的过程中，我们常常将n_estimators和参数learning_rate一起考虑。\n",
    "- __subsample：__ 子采样，取值为(0,1]。注意这里的子采样和随机森林不一样，随机森林使用的是放回抽样，而这里是不放回抽样。如果取值为1，则全部样本都使用，等于没有使用子采样。如果取值小于1，则只有一部分样本会去做GBDT的决策树拟合。选择小于1的比例可以减少方差，即防止过拟合，但是会增加样本拟合的偏差，因此取值不能太低。推荐在$[0.5, 0.8]$之间，默认是1.0，即不使用子采样。\n",
    "- __init：__ 弱学习器，$f_0(x)$，如果不输入，则用训练集样本来做样本集的初始化分类回归预测。否则用init参数提供的学习器做初始化分类回归预测。一般用在我们对数据有先验知识，或者之前做过一些拟合的时候，如果没有的话就不用管这个参数了。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据\n",
    "\n",
    "数据集：比马印第安人糖尿病数据集（Pima Indians Diabetes Dataset）涉及根据医疗记录预测比马印第安人5年内糖尿病的发病情况。\n",
    "\n",
    "字段说明：\n",
    "1. Pregnancies：怀孕次数 \n",
    "2. Glucose：在口服葡萄糖耐受试验中，血糖浓度为2小时 \n",
    "3. BloodPressure：舒张压(mm Hg) \n",
    "4. SkinThickness：肱三头肌皮肤褶皱厚度(mm) \n",
    "5. Insulin：2小时血清胰岛素(mu U/ml) \n",
    "6. BMI：身体质量指数(公斤/重量(高度)^ 2) \n",
    "7. DiabetesPedigreeFunction：糖尿病血统函数 \n",
    "8. Age：年龄(岁) \n",
    "9. Outcome：类变量(0或1) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_data(fileName):\n",
    "    columeLabel = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age','Outcome']\n",
    "    dataset = pd.read_csv(fileName, sep='\\t', header=None, index_col=False, names=columeLabel)\n",
    "    x_columns = [x for x in dataset.columns if x not in 'Outcome']\n",
    "    return dataset[x_columns], dataset['Outcome'], dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载训练数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, y_train, trainDataset = create_data('homework_trainingDataset.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 采用sklearn中的GradientBoostingClassifier进行训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import GradientBoostingClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n",
       "              learning_rate=0.5, loss='deviance', max_depth=3,\n",
       "              max_features=None, max_leaf_nodes=None,\n",
       "              min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "              min_samples_leaf=1, min_samples_split=2,\n",
       "              min_weight_fraction_leaf=0.0, n_estimators=100,\n",
       "              n_iter_no_change=None, presort='auto', random_state=None,\n",
       "              subsample=1.0, tol=0.0001, validation_fraction=0.1,\n",
       "              verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = GradientBoostingClassifier(learning_rate=0.5)\n",
    "clf.fit(np.array(X_train), np.array(y_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 对测试数据进行预测，并得到预测正确率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test, y_test, testDataset = create_data('homework_testDataset.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.75"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.score(X_test, y_test)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
